Introduction

Sometimes you want to read in data that contains different kinds of information per line. This could be a log file where some lines contain a timestamp and the log messages itself, but other lines could contain specific information about errors or stack traces.

The first thing that we need to accomplish is to read in the file into Pandas. There is a nice little trick that you can use to read almost every kind of structured data into a Pandas DataFrame. We just have to use a separator that reads in the content of a file line by line. We then have to mark each line by their meaning. We can to this by taking some Once we have the data into a DataFrame, we can use Pandas functionality to reshape the data as we need it.

Example data set

For this example, I choose a Git log output from the Sprint PetClinic project. Such a dataset look like the following:


commit 75912a06c5613a2ea1305ad4d8ad6bc4be7765ce
Author: Stephane Nicoll <snicoll@pivotal.io>
Date:   Fri Feb 17 12:30:57 2017 +0100

    Polish contribution

    Closes gh-229

commit 443d35eae23c874ed38305fbe75216339c41beaf
Author: Henri Tremblay <henri.tremblay@gmail.com>
Date:   Thu Feb 16 15:08:30 2017 -0500

    Put Ehcache back

This is a pretty challenging dataset: Albeit we have some lines that could be used as markers (like "commit:", "Author" and "Date:"), we also have multiple lines without a marker text. Luckily, there is some convention here: The first line in this particular dataset is the so called "subject" or "title line", the other lines is the "full commit message". This makes it possible to differentiate between these two kind of information as well. So the data structure for one complete Git log entry looks like this (from doc):


commit <sha1>
Author: <author>
Date:   <author date>
[empty]
<title line>
[empty]
<full commit message>
[empty]

It may seem impossible to read in that data in a Pandas DataFrame, but we'll see that it's relative easy to do once we know some basic techniques for treating these kind of semi-structured data.

Granted, this particular case is some kind of artificial because we can output the information in a Git repository in one line using the `--pretty=format:` option. But I choose that one as example because it's easy to understand. Overall, there could be other cases where the following technique could be get very handy.

Reading in the file

Let's start by reading in the given dataset. Especially take a look at the choosen sep parameter: Here, we use the newline operator \n as separator. This puts every line from our dataset file into a separate data row of a DataFrame.

The parameter names takes a array of column names that present the headers for the DataFrame. This is necessary for our specific dataset because it doesn't have any specific header at the top of the file.

We also make use of the skipinitialspace parameter. This will not delete the whitespace in front of the first characters. In our case, this is very useful because the commit's subject and message lines are prefixed with four whitespace characters. By setting skipinitialspace=False, we can ensure that we don't interpret this data wrongly as some other type of data like the commit or author line.

There is one keyword that could be useful for this task, too, but isn't used here. The skip_blank_lines parameter will not jump over lines in the given dataset, that are empty. Again, the use of this parameters depends highly on your kind if data.


In [14]:
import pandas as pd
git_log = pd.read_csv(
    '../notebooks/datasets/git_log_sample.txt',
    sep='\n',
    names=['raw'],
    skipinitialspace=False)
git_log.head()


Out[14]:
raw
0 commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
1 Author: Dave Syer <dsyer@pivotal.io>
2 Date: Fri Jun 30 11:07:07 2017 +0100
3 Update Spring Boot and Thymeleaf versions
4 commit ffa967c94b65a70ea6d3b44275632821838d9fd3

Marking data that belongs together

Next, we need a marking that shows which entries belong together (speaking in Data Science terms: are just variables for one observation). For this, we need an entry that marks the beginning of a such a group of entries. We can use the entries that start with commit together with the loc method for this purpose.


In [15]:
git_log_with_marker = git_log.reset_index()
git_log_with_marker.head()


Out[15]:
index raw
0 0 commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
1 1 Author: Dave Syer <dsyer@pivotal.io>
2 2 Date: Fri Jun 30 11:07:07 2017 +0100
3 3 Update Spring Boot and Thymeleaf versions
4 4 commit ffa967c94b65a70ea6d3b44275632821838d9fd3

Next, we choose a suitable text in the dataset that marks the beginning of a new group of consecutive data. In our case, these are entries that start with the text "commit". We set all other entries to None. With this


In [16]:
git_log_with_marker.loc[
      ~git_log_with_marker['raw'].str.startswith('commit '), 
      'index'] = None
git_log_with_marker.head()


Out[16]:
index raw
0 0.0 commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed
1 NaN Author: Dave Syer <dsyer@pivotal.io>
2 NaN Date: Fri Jun 30 11:07:07 2017 +0100
3 NaN Update Spring Boot and Thymeleaf versions
4 4.0 commit ffa967c94b65a70ea6d3b44275632821838d9fd3

If we propagate the data from the former index column with a ffill() (which forward fills missing data), we get a nice column that marks entries that should belong together.


In [26]:
git_log_with_marker['index'] = git_log_with_marker['index'].ffill()
git_log_with_marker.head(10)


Out[26]:
index raw id
0 0.0 commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed 0.0
1 0.0 Author: Dave Syer <dsyer@pivotal.io> 0.0
2 0.0 Date: Fri Jun 30 11:07:07 2017 +0100 0.0
3 0.0 Update Spring Boot and Thymeleaf versions 0.0
4 4.0 commit ffa967c94b65a70ea6d3b44275632821838d9fd3 4.0
5 4.0 Author: Antoine Rey <antoine.rey@gmail.com> 4.0
6 4.0 Date: Wed Apr 12 21:41:00 2017 +0200 4.0
7 4.0 spring-petclinic-angular1 repo renamed to ... 4.0
8 8.0 commit fd1c742d4f8d193eb935519909c15302b783cd52 8.0
9 8.0 Author: Antoine Rey <antoine.rey@gmail.com> 8.0

The next steps are shown in the "Dissecting data" section.

The index key column approach

Alternatively, we can use also an entry that is suitable as index later on. Again, we use the entries with the commit information. Analog, we create a new column with this information. We clean up the commit columns at the same time.


In [18]:
git_log.loc[
    git_log.raw.str.startswith("commit"),
    'commit_id'] = git_log['raw'].str.replace("commit ", "")
git_log.head()


Out[18]:
raw commit_id
0 commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed 101c9dc69064633f697d93dcf0918bb4f74ff7ed
1 Author: Dave Syer <dsyer@pivotal.io> NaN
2 Date: Fri Jun 30 11:07:07 2017 +0100 NaN
3 Update Spring Boot and Thymeleaf versions NaN
4 commit ffa967c94b65a70ea6d3b44275632821838d9fd3 ffa967c94b65a70ea6d3b44275632821838d9fd3

Next, we fill all the other columns with the information about the commit.


In [19]:
git_log['commit_id'] = git_log['commit_id'].ffill()
git_log.head()


Out[19]:
raw commit_id
0 commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed 101c9dc69064633f697d93dcf0918bb4f74ff7ed
1 Author: Dave Syer <dsyer@pivotal.io> 101c9dc69064633f697d93dcf0918bb4f74ff7ed
2 Date: Fri Jun 30 11:07:07 2017 +0100 101c9dc69064633f697d93dcf0918bb4f74ff7ed
3 Update Spring Boot and Thymeleaf versions 101c9dc69064633f697d93dcf0918bb4f74ff7ed
4 commit ffa967c94b65a70ea6d3b44275632821838d9fd3 ffa967c94b65a70ea6d3b44275632821838d9fd3

Dissecting the data

We mark each entry by its meaning. We can achieve this by using the starting characters of each raw entry. For the commit messages, we don't have any information that could be used as marker. So we just fill in the missing information in the type entries with a "message" text in the last step.

Side note: When to use ' and when to use " ? When working directly with Pandas, the difference doesn't really matter. I use ' when I'm referencing keys or parameters, but " when I'm using text information. I'm trying to use it consinstent, though.


In [20]:
git_log.loc[git_log.raw.str.startswith("commit "), 'type'] = "commit"
git_log.loc[git_log.raw.str.startswith("Author: "), 'type'] = "author"
git_log.loc[git_log.raw.str.startswith("Date: "), 'type'] = "date"
git_log['type'] = git_log['type'].fillna('message')
git_log.head()


Out[20]:
raw commit_id type
0 commit 101c9dc69064633f697d93dcf0918bb4f74ff7ed 101c9dc69064633f697d93dcf0918bb4f74ff7ed commit
1 Author: Dave Syer <dsyer@pivotal.io> 101c9dc69064633f697d93dcf0918bb4f74ff7ed author
2 Date: Fri Jun 30 11:07:07 2017 +0100 101c9dc69064633f697d93dcf0918bb4f74ff7ed date
3 Update Spring Boot and Thymeleaf versions 101c9dc69064633f697d93dcf0918bb4f74ff7ed message
4 commit ffa967c94b65a70ea6d3b44275632821838d9fd3 ffa967c94b65a70ea6d3b44275632821838d9fd3 commit

With all lines marked by their types, we can filter out the rows that only contain the information about a commit because we copied that information in all the other rows in the previous step.


In [21]:
git_log = git_log[git_log['type'] != 'commit']
git_log.head()


Out[21]:
raw commit_id type
1 Author: Dave Syer <dsyer@pivotal.io> 101c9dc69064633f697d93dcf0918bb4f74ff7ed author
2 Date: Fri Jun 30 11:07:07 2017 +0100 101c9dc69064633f697d93dcf0918bb4f74ff7ed date
3 Update Spring Boot and Thymeleaf versions 101c9dc69064633f697d93dcf0918bb4f74ff7ed message
5 Author: Antoine Rey <antoine.rey@gmail.com> ffa967c94b65a70ea6d3b44275632821838d9fd3 author
6 Date: Wed Apr 12 21:41:00 2017 +0200 ffa967c94b65a70ea6d3b44275632821838d9fd3 date

In [22]:
git_log_data = git_log.pivot_table(
    index='commit_id',
    columns='type',
    values='raw' ,
    aggfunc=lambda x : " ".join(x))
git_log_data.head()


Out[22]:
type author date message
commit_id
024811d252f8d8218e6795d46203cff25971bc19 Author: Mic <misvy@vmware.com> Date: Thu Mar 14 18:04:36 2013 +0800 simplifying access to Integer
0504ec9fe345d9d34b15c374333f709fb147e6d6 Author: thinksh <thinkshihang@gmail.com> Date: Wed Feb 3 23:19:46 2016 -0500 Update petclinic_db_setup_mysql.txt Co...
053c84ecc95b246ef4a40fb3d4304e8908604af4 Author: Mic <misvy@vmware.com> Date: Mon Feb 3 09:31:44 2014 +0800 migrated to Spring 4.0.1
057015c14cce4791ff309419de8a8bd6339fd6e7 Author: Mic <misvy@vmware.com> Date: Fri Feb 15 15:31:04 2013 +0800 Spring MVC Test Framework and migration to...
05c1110dceeaef0626137a2f7a509add6617765b Author: Mic <misvy@vmware.com> Date: Tue Jan 15 09:29:01 2013 +0800 fixed content negotiation configuration

In [23]:
git_log_data[['author', 'email']] = git_log_data['author'].str.extract(
    "Author: (.*) <(.*)>", expand=True)
git_log_data.head()


Out[23]:
type author date message email
commit_id
024811d252f8d8218e6795d46203cff25971bc19 Mic Date: Thu Mar 14 18:04:36 2013 +0800 simplifying access to Integer misvy@vmware.com
0504ec9fe345d9d34b15c374333f709fb147e6d6 thinksh Date: Wed Feb 3 23:19:46 2016 -0500 Update petclinic_db_setup_mysql.txt Co... thinkshihang@gmail.com
053c84ecc95b246ef4a40fb3d4304e8908604af4 Mic Date: Mon Feb 3 09:31:44 2014 +0800 migrated to Spring 4.0.1 misvy@vmware.com
057015c14cce4791ff309419de8a8bd6339fd6e7 Mic Date: Fri Feb 15 15:31:04 2013 +0800 Spring MVC Test Framework and migration to... misvy@vmware.com
05c1110dceeaef0626137a2f7a509add6617765b Mic Date: Tue Jan 15 09:29:01 2013 +0800 fixed content negotiation configuration misvy@vmware.com

In [24]:
git_log_data['date'] =  git_log_data['date'].str.replace("Date: ", "")
git_log_data['date'].head()


Out[24]:
commit_id
024811d252f8d8218e6795d46203cff25971bc19      Thu Mar 14 18:04:36 2013 +0800
0504ec9fe345d9d34b15c374333f709fb147e6d6       Wed Feb 3 23:19:46 2016 -0500
053c84ecc95b246ef4a40fb3d4304e8908604af4       Mon Feb 3 09:31:44 2014 +0800
057015c14cce4791ff309419de8a8bd6339fd6e7      Fri Feb 15 15:31:04 2013 +0800
05c1110dceeaef0626137a2f7a509add6617765b      Tue Jan 15 09:29:01 2013 +0800
Name: date, dtype: object

In [25]:
git_log_data['date'] = pd.to_datetime(git_log_data['date'])
git_log_data.head()


Out[25]:
type author date message email
commit_id
024811d252f8d8218e6795d46203cff25971bc19 Mic 2013-03-14 10:04:36 simplifying access to Integer misvy@vmware.com
0504ec9fe345d9d34b15c374333f709fb147e6d6 thinksh 2016-02-04 04:19:46 Update petclinic_db_setup_mysql.txt Co... thinkshihang@gmail.com
053c84ecc95b246ef4a40fb3d4304e8908604af4 Mic 2014-02-03 01:31:44 migrated to Spring 4.0.1 misvy@vmware.com
057015c14cce4791ff309419de8a8bd6339fd6e7 Mic 2013-02-15 07:31:04 Spring MVC Test Framework and migration to... misvy@vmware.com
05c1110dceeaef0626137a2f7a509add6617765b Mic 2013-01-15 01:29:01 fixed content negotiation configuration misvy@vmware.com